Conversation
|
@microsoft-github-policy-service agree |
There was a problem hiding this comment.
Pull request overview
This PR adds Apache Arrow fetch support to the mssql-python driver, enabling efficient columnar data retrieval from SQL Server. The implementation provides three new cursor methods (arrow_batch(), arrow(), and arrow_reader()) that convert result sets into Apache Arrow data structures using the Arrow C Data Interface, bypassing Python object creation in the hot path for improved performance.
Key changes:
- Implemented Arrow fetch functionality in C++ that directly converts ODBC result sets to Arrow format
- Added three Python API methods for different Arrow data consumption patterns (single batch, full table, streaming reader)
- Added comprehensive test coverage for various data types, LOB columns, and edge cases
Reviewed changes
Copilot reviewed 3 out of 4 changed files in this pull request and generated 9 comments.
| File | Description |
|---|---|
| mssql_python/pybind/ddbc_bindings.cpp | Core C++ implementation: Added FetchArrowBatch_wrap() function with Arrow C Data Interface structures, column buffer management, data type conversion logic, and memory management for Arrow structures |
| mssql_python/cursor.py | Python API layer: Added arrow_batch(), arrow(), and arrow_reader() methods that wrap the C++ bindings and handle pyarrow imports |
| tests/test_004_cursor.py | Comprehensive test suite covering wide tables, LOB columns, individual data types, empty result sets, datetime handling, and batch operations |
| requirements.txt | Added pyarrow as a dependency for development and testing |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Hi @ffelixg Thanks for raising this PR. Please allow us time to review and share our comments. Appreciate your diligence in strengthening this project. Sumit |
|
Hello @ffelixg Me and my team are in the process of reviewing your PR. While we are getting started, it would be great to have some preliminary information from you on the following items:
Regards, |
|
Hello @sumitmsft, I'm happy to hear that.
Regards, |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
bewithgaurav
left a comment
There was a problem hiding this comment.
@ffelixg - Thanks for the contribution! :)
Before we get started on this PR - there are a few dev build workflows we need to fix.
Could you please take a look at the Azure DevOps Workflows which are failing? (goes by the check MSSQL-Python-PR-Validation):
|
Hey, the Windows issue was due to me using a 128 bit integer type which isn't supported by MSVC. To address that, I've added a custom 128 bit int type implemented as two 64 bit ints with some overloaded operators. I'm not super happy about that, it seems to me that using an existing library for this type of thing would be better. If you prefer to use a library, I'd leave the choice of which one to use up to you though. The 128 bit type is only needed for decimals, so an alternative solution would be to use the numeric struct instead of strings for fetching. That one has near identical bit-representation compared to arrow and wouldn't require a lot of modification. But that's a change that would affect fetching to Python objects as well, since the two paths should probably stay in sync. The MacOS issue seems to be due to std::mktime failing. I've added an implementation of days_from_civil to eliminate that call. I think a newer version of c++ would include that function. CPython also has an implementation for that in I noticed some CI errors related to |
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…ring ownership to arrow
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
📊 Code Coverage Report
Diff CoverageDiff: main...HEAD, staged and unstaged changes
Summary
mssql_python/cursor.pyLines 35-43 Lines 2412-2425 Lines 2440-2449 Lines 2467-2476 mssql_python/pybind/ddbc_bindings.cppLines 4230-4239 0 ""0 ""0 ""1 "/usr/include/stdc-predef.h" 1 3 40 "" 21 ""Lines 4258-4267 0 ""0 ""0 ""1 "/usr/include/stdc-predef.h" 1 3 40 "" 21 ""Lines 4266-4275 0 ""0 ""0 ""1 "/usr/include/stdc-predef.h" 1 3 40 "" 21 ""Lines 4276-4284 0 ""0 ""0 ""1 "/usr/include/stdc-predef.h" 1 3 40 "" 21 ""Lines 4290-4299 0 ""0 ""0 ""1 "/usr/include/stdc-predef.h" 1 3 40 "" 21 ""Lines 4326-4336 0 ""0 ""0 ""1 "/usr/include/stdc-predef.h" 1 3 40 "" 21 ""Lines 4371-4379 0 ""0 ""0 ""1 "/usr/include/stdc-predef.h" 1 3 40 "" 21 ""Lines 4418-4426 0 ""0 ""0 ""1 "/usr/include/stdc-predef.h" 1 3 40 "" 21 ""Lines 4438-4446 0 ""0 ""0 ""1 "/usr/include/stdc-predef.h" 1 3 40 "" 21 ""Lines 4455-4464 0 ""0 ""0 ""1 "/usr/include/stdc-predef.h" 1 3 40 "" 21 ""Lines 4468-4482 0 ""0 ""0 ""1 "/usr/include/stdc-predef.h" 1 3 40 "" 21 ""Lines 4511-4521 0 ""0 ""0 ""1 "/usr/include/stdc-predef.h" 1 3 40 "" 21 ""Lines 4533-4543 0 ""0 ""0 ""1 "/usr/include/stdc-predef.h" 1 3 40 "" 21 ""Lines 4560-4570 0 ""0 ""0 ""1 "/usr/include/stdc-predef.h" 1 3 40 "" 21 ""Lines 4576-4590 0 ""0 ""0 ""1 "/usr/include/stdc-predef.h" 1 3 40 "" 21 ""Lines 4593-4603 0 ""0 ""0 ""1 "/usr/include/stdc-predef.h" 1 3 40 "" 21 ""Lines 4607-4617 0 ""0 ""0 ""1 "/usr/include/stdc-predef.h" 1 3 40 "" 21 ""Lines 4621-4631 0 ""0 ""0 ""1 "/usr/include/stdc-predef.h" 1 3 40 "" 21 ""Lines 4635-4645 0 ""0 ""0 ""1 "/usr/include/stdc-predef.h" 1 3 40 "" 21 ""Lines 4649-4675 0 ""0 ""0 ""1 "/usr/include/stdc-predef.h" 1 3 40 "" 21 ""Lines 4678-4688 0 ""0 ""0 ""1 "/usr/include/stdc-predef.h" 1 3 40 "" 21 ""Lines 4693-4703 0 ""0 ""0 ""1 "/usr/include/stdc-predef.h" 1 3 40 "" 21 ""Lines 4709-4719 0 ""0 ""0 ""1 "/usr/include/stdc-predef.h" 1 3 40 "" 21 ""Lines 4723-4733 0 ""0 ""0 ""1 "/usr/include/stdc-predef.h" 1 3 40 "" 21 ""Lines 4737-4751 0 ""0 ""0 ""1 "/usr/include/stdc-predef.h" 1 3 40 "" 21 ""Lines 4753-4763 0 ""0 ""0 ""1 "/usr/include/stdc-predef.h" 1 3 40 "" 21 ""Lines 4767-4777 0 ""0 ""0 ""1 "/usr/include/stdc-predef.h" 1 3 40 "" 21 ""Lines 4781-4800 0 ""0 ""0 ""1 "/usr/include/stdc-predef.h" 1 3 40 "" 21 ""Lines 4809-4817 0 ""0 ""0 ""1 "/usr/include/stdc-predef.h" 1 3 40 "" 21 ""Lines 4830-4840 0 ""0 ""0 ""1 "/usr/include/stdc-predef.h" 1 3 40 "" 21 ""Lines 4843-4852 0 ""0 ""0 ""1 "/usr/include/stdc-predef.h" 1 3 40 "" 21 ""Lines 4850-4858 0 ""0 ""0 ""1 "/usr/include/stdc-predef.h" 1 3 40 "" 21 ""Lines 4857-4866 0 ""0 ""0 ""1 "/usr/include/stdc-predef.h" 1 3 40 "" 21 ""Lines 4864-4873 0 ""0 ""0 ""1 "/usr/include/stdc-predef.h" 1 3 40 "" 21 ""Lines 4900-4909 0 ""0 ""0 ""1 "/usr/include/stdc-predef.h" 1 3 40 "" 21 ""Lines 4933-4941 0 ""0 ""0 ""1 "/usr/include/stdc-predef.h" 1 3 40 "" 21 ""Lines 4956-4964 0 ""0 ""0 ""1 "/usr/include/stdc-predef.h" 1 3 40 "" 21 ""Lines 4995-5004 0 ""0 ""0 ""1 "/usr/include/stdc-predef.h" 1 3 40 "" 21 ""Lines 5006-5014 0 ""0 ""0 ""1 "/usr/include/stdc-predef.h" 1 3 40 "" 21 ""Lines 5024-5040 0 ""0 ""0 ""1 "/usr/include/stdc-predef.h" 1 3 40 "" 21 ""Lines 5116-5131 0 ""0 ""0 ""1 "/usr/include/stdc-predef.h" 1 3 40 "" 21 ""Lines 5214-5229 0 ""0 ""0 ""1 "/usr/include/stdc-predef.h" 1 3 40 "" 21 ""📋 Files Needing Attention📉 Files with overall lowest coverage (click to expand)mssql_python.pybind.logger_bridge.hpp: 58.8%
mssql_python.pybind.logger_bridge.cpp: 59.2%
mssql_python.row.py: 66.2%
mssql_python.helpers.py: 67.5%
mssql_python.pybind.ddbc_bindings.h: 71.7%
mssql_python.pybind.ddbc_bindings.cpp: 72.8%
mssql_python.pybind.connection.connection.cpp: 73.6%
mssql_python.ddbc_bindings.py: 79.6%
mssql_python.pybind.connection.connection_pool.cpp: 79.6%
mssql_python.connection.py: 83.9%🔗 Quick Links
|
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
gargsaumya
left a comment
There was a problem hiding this comment.
This is a really solid contribution. The Arrow C Data Interface implementation is well structured, the PyCapsule ownership chain is handled correctly with RAII and proper try/catch guards, and the test coverage is comprehensive across a wide range of SQL types.
I’ve added a few non-blocking suggestions for improvement.
Thanks, @ffelixg , for the great work on this! 🎉
| case SQL_BIGINT: | ||
| arrowColumnProducer->int64Val[idxRowArrow] = buffers.bigIntBuffers[idxCol][idxRowSql]; | ||
| break; | ||
| case SQL_REAL: |
There was a problem hiding this comment.
In the LOB fetch path (line 4660), SQL_REAL is correctly fetched into buffers.realBuffers as SQLREAL (4-byte [float]). But here SQL_REAL falls through with SQL_FLOAT/SQL_DOUBLE and reads from buffers.doubleBuffers which is an 8-byte double buffer that was never populated for this column. This might produce garbage values for any real column when a LOB column is also present in the result set.
Probably we can give SQL_REAL its own case.
| A pyarrow RecordBatch object containing up to batch_size rows. | ||
| """ | ||
| self._check_closed() # Check if the cursor is closed | ||
| if not self._has_result_set and self.description: |
There was a problem hiding this comment.
arrow_batch checks self._has_result_set and calls self._reset_rownumber() when it's False, but never sets it to True afterward. This means every call in a loop (e.g., from arrow()) hits _reset_rownumber() repeatedly.
Other fetch methods (fetchone, fetchmany, etc.) set self._has_result_set = True after the first successful fetch. Should this do the same?
| A pyarrow Table containing all remaining rows from the result set. | ||
| """ | ||
| try: | ||
| import pyarrow |
There was a problem hiding this comment.
arrow_batch() calls self._check_closed() at the top, but arrow() and arrow_reader() don't. If pyarrow isn't installed and the cursor is closed, the user may get ImportError instead of the more helpful "cursor is closed" error.
Maybe we should add self._check_closed() before the import pyarrow in both methods?
| coverage | ||
| unittest-xml-reporting | ||
| psutil | ||
| pyarrow |
There was a problem hiding this comment.
The Python code already handles ImportError gracefully with a helpful message. Should this be under an optional extras group in setup.py (e.g., pip install mssql-python[arrow]) instead of being a hard requirement for all users?
For the test file it's fine, but this requirements.txt drives the project's general dependencies, not just test deps.
@sumitmsft @bewithgaurav thoughts?
| std::unique_ptr<int64_t[]> int64Val; | ||
| std::unique_ptr<double[]> float64Val; | ||
| std::unique_ptr<uint8_t[]> bitVal; | ||
| std::unique_ptr<uint32_t[]> varVal; |
There was a problem hiding this comment.
varVal is uint32_t[], making the offset buffer a 32-bit offset array. As per the Arrow spec, this means variable-length columns are capped at ~4 GB of data per batch. That's probably fine for typical use, but it could overflow silently with large batch_size + large varchar(max) data.
For example, if someone fetches 8,192 rows and each row contains a 1 MB varchar, that results in ~8 GB of data. In that case, the uint32 offsets would wrap around silently, and pyarrow would end up reading corrupted data.
Would it make sense to add a runtime check to ensure the total accumulated byte size does not exceed UINT32_MAX? Please correct me if I’m misunderstanding anything here.
Work Item / Issue Reference
Summary
Hey, you mentioned in issue #130 that you were willing to consider community contributions for adding Apache Arrow support, so here you go. I have focused only on fetching data into Arrow structures from the Database.
The Function signatures I chose are:
arrow_batch(chunk_size=10000): Fetch a singlepyarrow.RecordBatch, base for the other two methods.arrow(chunk_size=10000): Fetches the entire result set as a singlepyarrow.Table.arrow_reader(chunk_size=10000): Returns apyarrow.RecordBatchReaderfor streaming results without loading the entire dataset into RAM.Using
fetch_arrow...instead of justarrow...could also be a good option, but I think the terse version is not too ambiguous.Technical details
I am not very familiar with C++, but I did have some prior practice for this task from implementing my own ODBC driver in Zig (a very good language for projects like this!). The implementation is written almost entirely in C++ in the
FetchArrowBatch_wrapfunction, which produces PyCapsules that are then consumed byarrow_batchand turned into actual arrow objects.The function itself is very large. I'm sure it could be factored in a better way, even sharing some code with the other methods of fetching, but my goal was to keep the whole thing as straight forward as possible.
I have also implemented my own loop for SQLGetData for Lob-Columns. Unlike with the python fetch methods, I don't use the result directly, but instead copy it into the same buffer I would use for the case with bound columns. Maybe that's an abstraction that would make sense for that case as well.
Notes on data types
I noticed that you use SQL_C_TYPE_TIME for time(x) columns. The arrow fetch does the same, but I think it would be better to use SQL_C_SS_TIME2, since that supports fractional seconds.
Datetimeoffset is a bit tricky, since SQL Server stores timezone information alongside each cell, while arrow tables expect a fixed timezone for the entire column. I don't really see any solution other than converting everything to UTC and returning a UTC column, so that's what I did.
SQL_C_CHAR columns get copied directly into arrow utf8 arrays. Maybe some encoding options would be useful.
Performance
I think the main performance win to be gained is not interacting with any Python data structures in the hot path. That is satisfied. Further optimizations, which I did not make are:
Instead of looping over rows and columns and then switching on the data type for each cell, you could
Overall the arrow performance seems not too far off from what I achieved with zodbc.